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ABSTRACT 

Multi-way Theta-join queries are powerful in describing com- 
plex relations and therefore widely employed in real prac- 
tices. However, existing solutions from traditional distribut- 
ed and parallel databases for multi-way Theta-join queries 
cannot be easily extended to fit a shared-nothing distributed 
computing paradigm, which is proven to be able to sup- 
port OLAP applications over immense data volumes. In 
this work, we study the problem of efficient processing of 
multi-way Theta-join queries using MapReduce from a cost- 
effective perspective. Although there have been some works 
using the (key, value) pair-based programming model to sup- 
port join operations, efficient processing of multi-way Theta- 
join queries has never been fully explored. The substantial 
challenge lies in, given a number of processing units (that 
can run Map or Reduce tasks), mapping a multi-way Theta- 
join query to a number of MapReduce jobs and having them 
executed in a well scheduled sequence, such that the total 
processing time span is minimized. Our solution mainly in- 
cludes two parts: 1) cost metrics for both single MapReduce 
job and a number of MapReduce jobs executed in a certain 
order; 2) the efficient execution of a chain-typed Theta-join 
with only one MapReduce job. Comparing with the query 
evaluation strategy proposed in [23] and the widely adopted 
Pig Latin and Hive SQL solutions, our method achieves sig- 
nificant improvement of the join processing efficiency. 

1. INTRODUCTION 

Data analytical queries in real practices commonly in- 
volve multi-way join operations. The operators involved in a 
multi-way join query are more than just Equi-join. Instead, 
the join condition can be defined as a binary function 6 that 
belongs to {<,<,=,>,>,<>}, as known as Theta-join. Com- 
pared with Equi-join, it is more general and expressive in 
relation description and surprisingly handy in data analytic 
queries. Thus, efficient processing of multi-way Theta-join 
queries plays a critical role in the system performance. In 
fact, evaluating multi-way Theta-joins has always been a 
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challenging problem along with the development of database 
technology. Early works, like [8] [26] [22] and etc., have elab- 
orated the complexity of the problem and presented their 
evaluation strategies. However, their solutions do not scale 
to process the multi-way Theta-joins over the data of tremen- 
dous volumes. For instance, as reported from Facebook [5] 
and Google [11], the underlying data volume is of hundreds 
of tera-bytes or even peta-bytes. In such scenarios, solu- 
tions from the traditional distributed or parallel databases 
are infeasible due to unsatisfactory scalability and poor fault 
tolerance. 

On the contrary, (key,value)-based MapReduce program- 
ming model substantially guarantees great scalability and 
strong fault tolerance property. It has emerged as the most 
popular processing paradigm in a shared-nothing computing 
environment. Recently, devoting research efforts towards ef- 
ficient and effective analytic processing over immense data 
have been made within the MapReduce framework. Cur- 
rently, the database community mainly focuses on two is- 
sues. First, the transformation from certain relational al- 
gebra operator, like similarity join, to its (key,value) -based 
parallel implementation. Second, the tuning or re-design 
of the transformation function such that the MapReduce 
job is executed more efficiently in terms of less time cost or 
computing resources consumption. Although various rela- 
tional operators, like pair-wise Theta-join, fuzzy join, aggre- 
gation operators and etc., are evaluated and implemented 
using MapReduce, there is little effort exploring the effi- 
cient processing of multi-way join queries, especially more 
general computation namely Theta-join, using MapReduce. 
The reason is that, the problem involves more than just a 
relational operator (key , value) pair transformation and the 
tuning, there are other critical issues needed to be addressed: 
1) How many MapReduce jobs should we employ to evaluate 
the query? 2) What is each MapReduce job responsible for? 
3) How should multiple MapReduce jobs be scheduled? 

To address the problem, there are two challenging issues 
needed to be resolved. Firstly, the number of available com- 
puting units is in fact limited, which is often neglected when 
mapping a task to a set of MapReduce jobs. Although 
the pay-as-you-go policy of Cloud computing platform could 
promise as many computing resources as required, however, 
once a computing environment is established, the allowed 
maximum number of concurrent Map and Reduce tasks is 
fixed according to the system configuration. Even taken 
the auto scaling feature of Amazon EC2 platform [18] into 
consideration, the maximum number of involved computing 
units are pre-determined by the user-defined profiles. There- 
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fore, with the user specified Reduce task number, a multi- 
way Theta-join query is processed with only limited number 
of available computing units. 

The second challenge is that, the decomposition of a multi- 
way Theta-join query into a number of MapReduce tasks is 
non-trivial. Work [28] targets at the multi-way Equi-join 
processing. It decomposes a query into several MapReduce 
jobs and schedules the execution based on a specific cost 
model. However, it only considers the pair-wise join as the 
basic scheduling unit. In other words, it follows the tradi- 
tional multi-way join processing methodology, which eval- 
uates the query with a sequence of pair-wise joins. This 
methodology excludes the possible optimization opportunity 
to evaluate a multi-way join in one MapReduce job. Our 
observation is that, under certain conditions, evaluating a 
multi-way join with one MapReduce job is much more effi- 
cient than with a sequence of MapReduce jobs conducting 
pair- wise joins. Work [23] reports the same observation. One 
dominating reason is that, the I/O costs of intermediate re- 
sults generated by multiple MapReduce jobs may become 
unacceptable overheads. Work [2] presents the solution of 
evaluating a multi-way join in one MapReduce job, which 
only works for the Equi-join case. Since the Theta-join can- 
not be answered by simply making the join attribute the 
partition key, thus, the solution proposed in [2] cannot be ex- 
tended to solve the case of multi-way Theta-joins. Work [25] 
demonstrates effective pair- wise Theta-join processing using 
MapReduce by partitioning a two dimensional result space 
formed by the cross-product of two relations. For the case 
of multi-way join, the result space is a hyper-cube, whose 
dimensionality is the number of the relations involved in 
the query. Unfortunately, work [25] does not explore how 
to extend their solution to handle the partition in high di- 
mensions. Moreover, the question about whether we should 
evaluate a complex query with a single MapReduce job or 
several MapReduce jobs, is not clear yet. Therefore, there 
is no straightforward solution to combine the techniques in 
existing literatures to evaluate a multi-way Theta-join query. 

Meanwhile, assume a set of MapReduce jobs are gener- 
ated for the query evaluation. Then given a limited number 
of processing units, it remains a challenge to schedule the 
execution of MapReduce jobs, such that the query can be 
answered with the minimum time span. These jobs may have 
dependency relationships and inter-competition for resource 
consumptions during the concurrent execution. Currently, 
the MapReduce framework requires the number of Reduce 
tasks as a user specified input. Thus, after decomposing a 
multi-way Theta-join query into a number of MapReduce 
jobs, one challenging issue is how to specify each job a 
proper Reduce task number, such that the overall scheduling 
achieves the minimum execution time span. 

Specifically, the problem that we are working on is: given 
a number of processing units (that can run Map or Re- 
duce tasks), mapping a multi-way Theta-join to a number of 
MapReduce jobs and having them executed in a well sched- 
uled order, such that the total processing time span is mini- 
mized. Our solution to this challenging problem includes two 
core techniques. The first one is, given a multi-way Theta- 
join query, we examine all the possible decomposition plans 
and estimate the minimum execution time cost for each plan. 
Especially, we figure out the rules to properly decompose the 
original multi-way Theta-join query and study the most ef- 
ficient solution to evaluate multiple join condition functions 



using one MapReduce job. The second technique is that, 
given a limited number of computing units and a pool of 
possible MapReduce jobs to evaluate the query, we design a 
novel solution to select jobs to effectively evaluate the query 
as fast as possible. To evaluate the cost, we develop an I/O 
and network aware cost model to describe the behavior of a 
MapReduce job. 

To the best of our knowledge, this is the first work explor- 
ing the multi-way Theta-joins evaluation using MapReduce. 
Our main contributions are listed as follows: 

• We establish the rules to decompose a multi-way join 
query. Under our proposed cost model, we can figure 
out whether a multi-way join query should be evalu- 
ated with multiple MapReduce jobs or a single MapRe- 
duce job. 

• We develop a resource aware (key, value) pair distri- 
bution method to evaluate the chain-typed multi-way 
Theta-join query with one MapReduce job, which guar- 
antees minimized volume of data copying over the net- 
work, as well as evenly distributed workload among 
Reduce tasks. 

• We validate our cost model and the solution for multi- 
way Theta-join queries with extensive experiments. 

The rest of the paper is organized as follows. In Section 2, 
we briefly review the MapReduce computing paradigm and 
elaborate the application scenario for multi-way Theta-joins. 
We formally define our problem in Section 3 and present 
our cost model in section 4. We take Section 5 to explain 
our query evaluation strategies in details. We validate our 
solution in Section 6 with extensive experiments on both real 
and synthetic data sets. We summarize and compare the 
most recent closely related work in Section 7 and conclude 
our work in Section 8. 

2. PRELIMINARIES 

In this section we briefly present the MapReduce program- 
ming model and how it has been applied to evaluate join 
queries. More importantly, we elaborate the difficulties and 
limitations of current solutions to solve the multi-way Theta- 
joins with a concrete example. 

2.1 MapReduce & Join Processing 

MapReduce provides a simple parallel programming model 
for data-intensive applications in a shared-nothing environ- 
ment [12]. It was originally developed for indexing crawled 
websites and OLAP applications. Generally, a Master node 
invokes Map tasks on computing nodes that possess the 
input data, which guarantees the locality of computation. 
Map tasks transform the input (key, value) pair (k 1 jV 1 ) to n 
new pairs: (k\,v\), (kn,v%). The output of Map 

tasks are then partitioned by the default hashing to differ- 
ent Reduce tasks according to fcf . Once the Reduce tasks 
receive (key,value) pairs grouped by fc^ , they perform the 
user specified computation on all the values of each key, and 
write results back to the storage. 

Obviously, this (key,value)-b&sed programming model im- 
plies a natural implementation of Equi-join. By making the 
join attribute the key, records that can be joined together 
are sent to the same Reduce task. Even for the similarity 
join case [27], as long as the similarity metric is defined, 
each data record is assigned with a key set JC = {h, kj}, 
and the intersection of similar data records' key sets is never 
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empty. Thus, through such a mapping, it guarantees that 
similar data records will be sent to at least one common 
Reduce task. 

In fact, this key set method can be applied to any type of 
join operator. However, to ensure that joinable data records 
are always assigned to overlapping key sets, the cardinality 
of a data record's K, can be very large. In the worst case, 
it is the total number of Reduce tasks. Since the cardinal- 
ity of a record's K, implies the number of times this record 
being duplicated among Reduce tasks, the larger the value 
is, the more computing overheads in terms of I/O and CPU 
consumption will be introduced. Therefore, the essential op- 
timization goal is to find "the optimal" assignment of K, to 
each data record, such that the join query can be evaluated 
with minimized data transmission over the network. 

Another common concern about the MapReduce program- 
ming model is its poor immunity to key skews. If (key, value) 
pairs are highly unevenly distributed among Reduce tasks, 
the system throughput can degrade significantly. Unfortu- 
nately, this could be a common scenario in join operations. 
If there exist "popular" join attribute values, or the join con- 
dition is an inequality, some data records can be joined with 
huge number of data records from other relations, which 
implies significant key skew among the Reduce tasks. More- 
over, the fault tolerance property of the MapReduce pro- 
gramming model is guaranteed on the cost of saving all the 
intermediate results. Thus, the overhead of disk I/O domi- 
nates the time efficiency of iterative MapReduce jobs. The 
same observation has been made in [28] . 

In summary, to efficiently process join operations using 
MapReduce is non-trivial. Especially when it comes to multi- 
way join processing, selecting proper MapReduce jobs and 
deciding a proper K, for each data record make the problem 
more challenging. 

2.2 Multi-way Theta-Join 

Theta-join is the join operation that takes inequality con- 
ditions of join attributes' values into consideration, namely 
the join condition function 9 G {<, >, =, <>, <, >}. Multi- 
way Theta-join is a powerful analytic tool to elaborate com- 
plex data correlations. Consider the following application 
scenario: 

"Assume we have n cities, {c\, ci, c n }, and all the 
flights information FI iy j between any two cities c t and Cj. 
Given a sequence of cities < c s , c t >, and the stay-over 
time length which must fall in the interval Li = [li,h] at 
each city d, find out all the possible travel plans." 

This is a practical query that could help travelers plan 
their trips. For illustration purpose, we simply assume Flij 
is a table containing flight No., departure time (dt) and ar- 
rival time (at). Then the above request can be easily an- 
swered with a multi-way Theta-join operation over F/ SjS +i, 
FIt-i,t, by specifying the time interval between two suc- 
cessive flights falling into the particular city's stay-over in- 
terval requirement. For example, the 8 function between 
FI S:S+ i andF/ s+ i jS+ 2 is FI SiS+ i.at+L s+ i Ji < FI s+ i^ + 2.dt 
< FI s>s +i.at + L 3 +ii2- 

To evaluate such queries, a straightforward method is to 
iteratively conduct pair-wise Theta-join. However, this eval- 
uation strategy might exclude some more efficient evaluation 
plans. For instance, instead of using pair-wise joins, we can 
evaluate multiple join conditions in one task. Therefore, less 
MapReduce jobs are needed, which implies less computation 



overheads in terms of the disk I/O of intermediate results. 

3. PROBLEM DEFINITION 

In this work, we mainly focus on the efficient processing of 
multi-way Theta-joins using MapReduce. Our solution tar- 
gets on the MapReduce job identification and scheduling. 
In other words, we work on the rules to properly decom- 
pose the query processing into several MapReduce jobs and 
have them executed in a well scheduled fashion, such that 
the minimum evaluation time span is achieved. In this sec- 
tion, we shall first present the terminologies that we use in 
this paper, and then give the formal definition of the prob- 
lem. We show that the problem of finding the optimal query 
evaluation plan is NP hard. 

3.1 Terminology and Statement 

For the ease of presentation, in the rest of the paper we 
use the notation of "N-join" query to denote a multi-way 
Theta-join query. We use MRJ to denote a MapReduce job. 

Consider a N-join query Q defined over m relations Hi, 
lZ m and n specified join conditions 6i, 6 n . As adopted 
in many other works, like in [28], we can present Q as a 
graph, namely a join graph. For completeness, we define a 
join graph Qj as follows: 

Definition 1 A join graph Qj=(V, E, L) is a connected gra- 
ph with edge labels, where V ={v\v G {72i, ...,lZ m }}, E= 
{e\e = (vi, Vj ) 36, lit m 9 TZj G Q}, L={l\l( ei ) = 9i}. 

Intuitively, Qj is generated by making every relation in Q a 
vertex and connecting two vertices if there is a join operator 
between them. The edge is labeled with the corresponding 
join function 9. To evaluate Q, every 9 function, i.e., every 
edge from Qj, needs to be evaluated. However, to evaluate all 
the edges in Qj , there are exponential number of plans since 
any arbitrary number of connecting edges can be evaluated 
in one MRJ. We propose a join-path graph to cover all the 
possibilities. For the purpose of clear illustration, we define 
a no-edge-repeating path between two vertices of Qj in the 
first place. 

Definition 2 A no-edge-repeating path p between two ver- 
tices Vi and Vj in Qj is a traversing sequence of connecting 
edges (ej,...,e,) between Vi and Vj in Qj, in which no edge 
appears more than once. 

Definition 3 A join-path graph Gjp={V, E' , L' , W, S) is a 

complete weighted graph with edge labels, where each edge is 
associated with a weight and scheduling information. Specif- 
ically, V ={v\v G {72-1, TZ m }}, E'={e'\e' = (vi,Vj) repre- 
sents a unique no-edge-repeating path p between Vi and Vj 
in Qj}, L' = {l'\l'(e') = l'(v l ,v j ) = \Jl(e),e G P between Vi 
and Vj}, W = }w\w(e') is the minimal cost to evaluate e'}, 
S — {sjs(e') is the scheduling to evaluate e' at the cost of 
w(e')}. 

In the definition, the scheduling information on the edge 
refers to some user specified parameter to run a MRJ, such 
that this job is expected to be accomplished as fast as pos- 
sible. In this work, we consider the number of Reduce tasks 
assigned to a MRJ as the scheduling parameter, denoted 
as -RJV(MRJ), as it is the only parameter that users need 
to specify in their programs. The reason we take this pa- 
rameter into consideration is based on two observations from 
extensive experiments: 1) It is not guaranteed that the more 
computing units involved in Reduce tasks, the sooner a MRJ 
job is accomplished; 2) Given limited computing units, there 
is resource competition among multiple MRJs. 
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Intuitively, we enumerate all the possible join combina- 
tions in Qjp. Note that in the context of join processing, 
TZi txs TZk M TZj is the same with TZj cxi TZk txl TZi, therefore, 
Qjp is an undirected graph. We elaborate Definition 3.3 with 
the following example. Given a join graph Qj, shown on the 
left in Fig.l, a corresponding join-path graph Qjp is gener- 
ated, which is presented in an adjacent matrix format on the 
right. The numbers enclosed in bracelets are the involved 6 
functions on a path. For instance, in the cell corresponding 
to Ri and R2, {3, 4, 6, 5, 2} indicates a no-edge-repeating 
path {83,84,85,65,82} between Ri and 7?2- For this par- 
ticular example, notice that for every node there exists a 
closed traversing path (or circuit) which covers all the edges 
exactly once, namely the "Eulerian Circuit". We use £(Qjp) 
to denote a "Eulerian Circuit" of Qjp in the figure. Since 
we only care what edges are involved in a path, any £(Qjp) 
would be sufficient. Notice that in the figure, edge weights 
and scheduling information are not presented. As a matter 
of fact, these information are incrementally computed dur- 
ing the generation of Qjp, which will be illustrated in the 
later Section. 
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Figure 1: Example join graph Qj and its correspond- 
ing join-path graph Qjp, presented in an adjacent 
matrix 

According to the definition of C/jp , any edge e' in Qjp is a 
collection of connecting edges in Qj. Thus, e' in fact implies 
a subgraph of Qj. As we use one MRJ to evaluate e' , denoted 
as MRJ(e'), fyjp's edge set represents all the possible MRJs 
that can be employed to evaluate the original query Q. Let 
T denote a set of MRJs that are selected from Qjp's edge set. 
Intuitively, if the MRJs in T cover all the join conditions of 
the original query, we can answer the query by executing all 
these MRJs. Formally, we define that T is "sufficient" as 
follows: 

Definition 4 T , a collection of MRJs, is sufficient to eval- 
uate Q iff |J e'i = Qj.E, where MRJ(e' i )& T, 

Since it is trivial to check whether T is sufficient, for the 
rest of this work, we only consider the case that T is suf- 
ficient. Thus, given T, we define its execution plan V as a 
specific execution sequence of MRJs, which minimizes the 
time span of using T to evaluate the original query Q. For- 
mally, we can define our problem as follows: 
Problem Definition: Given a N-join query Q and fcp pro- 
cessing units, a join-path graph Qjp according to Q's join 
graph Cyj is built. We want to select a collection of edges 
from Qjp that correspondingly form a set of MRJs, denoted 
as l~o P t, such that there exists an execution plan V of Topt 
which minimizes the query evaluation time. 

Obviously, there are many different choices of T to evalu- 
ate Q. Moreover, given T and limited processing units, dif- 
ferent execution plans yield different evaluation time spans. 
In fact, the determination of V is non-trivial, we give the 
detailed analysis of the hardness of our problem in the next 



subsection. As we shall elaborate later, given T and fcp avail- 
able processing units, we adopt an approximation method to 
determine V in linear time. 

3.2 Problem Hardness 

According to the problem definition, we need two steps to 
find To V t'- 1) generate Qjp from Qj; 2) select MRJs for Topt- 
Neither one of these two steps is easy to solve. 

For the first step, to construct Qjp, we need to enumerate 
all the no-edge-repeating paths between any pair of vertices 
in Qj. Assume Qj has the "Eulerian trail" [16], which is a 
way to traverse the graph with every edge be visited exactly 
once, then for any pair of vertices «; and Vj, any different 
no-edge-repeating path between them is a "sub-path" of an 
Eulerian trail. If we know all the no-edge-repeating paths 
between any pair of vertices, we can enumerate all the Eule- 
rian trails in polynomial time. Therefore, the complexity of 
constructing Qjp is at least as hard as enumerating all the 
Eulerian trails of a given graph, which is known to be ftV- 
cornplete [6]. Moreover, we find that even Qj does not have 
an Eulerian trail, the problem complexity is not reduced at 
all, as we elaborate in the proof of the following theorem. 

Theorem 1 Generating Qjp from a given Qj is a #V com- 
plete problem. 

Proof. If Qj has the Eulerian trail, constructing Qjp is 
#V- complete (see the discussion above). 

On the contrary, if Qj does not have the Eulerian trail, it 
implies that there are r vertices having odd degrees, where 
r > 2. Now consider that we add one virtual vertex and con- 
necting it with r-1 vertices of odd degrees. Now the graph 
must have an Eulerian trail. If we can easily construct the 
join-path graph of the new graph, the original graph's Qjp 
can be computed in polynomial time. We elaborate with the 
following example, as shown in Fig. 2. Assume v s is added 
to the original Qj, then by computing the join-path graph 
of the new graph, we know all the no-edge-repeating paths 
between Vi and Vj . Then, a no-edge-repeating path between 
Vi and Vj cannot exist if it has v s involved. By simply re- 
moving all the enumerated paths that go through v s , we can 
obtain the Qjp of the original Qj. Thus, the dominating cost 
of constructing Qjp is still the enumeration of all Eulerian 
trails. Therefore, this problem is ^P- complete. □ 




Figure 2: Adding virtual vertex v s to Qj 

Although it is difficult to compute the exact Qjp, we find 
that a subgraph of Qjp , which contains all the vertices and 
denoted as Q'jp, could be sufficient to guarantee the optimal 
query evaluation efficiency. We take the following principle 
into the consideration. Given the same number of processing 
units, if it takes longer time to evaluate TZi txl TZj \xs TZk 
with one MRJ compared to the total time cost of evaluating 
TZi cxs TZj and TZj txl TZk separately and merging the results, 
we do not take TZi txl TZj ix TZk M TZ 3 into consideration. 
By following this principle, we can avoid enumerating all 
the possible no-edge-repeating paths between any pair of 
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vertices. As a matter of fact, we can obtain such a sufficient 
Q'jp in polynomial time. 

The second step of our solution is to select the 7~ op t- As- 
sume the Qjp computed from the first step provides a col- 
lection of edges, accordingly, we have a collection of MR J 
candidates to evaluate the query. Although each edge in 
Qjp is associated with a weight denoting the minimum time 
cost to evaluate all the join conditions contained in this edge, 
it is just an estimated time span on the condition that there 
are enough processing units. However, when a T is chosen, 
and the number of processing units is limited, the time cost 
of using T to answer Q need to be re-estimated. Assume 
we can find the time cost estimation of T, denoted as C (T) , 
then the problem is to find such an optimal T op t from all 
possible Ts, which has the minimum time cost. Apparently, 
this is a variance of the classic set cover problem, which is 
known to be NP hard [10]. Therefore, there are many heuris- 
tics and approximation algorithms can be adopted to solve 
the selection problem. 

As clearly indicated in the problem definition, the solution 
lies in the construction of Q'j P and smartly select T based on 
the cost estimation of a group of MRJs. Therefore, for the 
rest of the paper, we shall first elaborate our cost models for 
a single MRJ and a group of MRJs. Then we present our 
detailed solution for the N-join query evaluation. 

4. COST MODEL 

To highlight our observations on how much the overlap- 
ping of computation and network cost would affect the ex- 
ecution of a MRJ, in this section we present a generalized 
analytical study on the execution time of both a single MRJ 
and a group of MRJs. In the context of Qjp construction 
and T selection, we study the estimation of w(e'), where 
e' G Qjp-E, and C(T), which is the time cost to evaluate T. 

4.1 Estimating w(e'): Model for Single MRJ 

Since our target is to find an optimal join plan, we only 
consider the processing cost of join operations with MRJs. 
Generally, most of the CPU time for join processing is spent 
on simple comparison and counting, thus, system I/O cost 
dominates the total execution time. For MapReduce jobs, 
heavy cost on large scale sequential disk scan and frequent 
I/O of intermediate results dominate the execution time. 
Therefore, we shall build a model for a MRJ's execution 
time based on the analysis of I/O and network cost. 

General MapReduce computing framework involves three 
phases of data processing: Map, Reduce and the data copy- 
ing from Map tasks to Reduce tasks, as shown in Fig. 3. 
In the figure, each "M" stands for a Map task; each "CP" 
stands for one phase of Map output copying over network, 
and each "R" stands for a Reduce task. Since each Map 
task is based on a data block, we assume that the unit pro- 
cessing cost for each Map task is tM- Moreover, since the 
entire input data may not be loaded into the system mem- 
ory within one round [12] [3], we assume these Map tasks are 
performed round by round (we have the same observation in 
practice). However, the size of Reduce task is subjected to 
the (key, value) distribution. As shown in Fig. 3, the make 
span of a MRJ is dominated by the most time consuming 
Reduce task. Therefore, we only consider the Reduce task 
with the largest volume of inputs in the following analysis. 
Assume the total input size of a MRJ is Si, the total inter- 
mediate data copied from Map to Reduce is of size Sep, the 



number of Map tasks and Reduce tasks are m and n, respec- 
tively. In addition, as a general assumption, Si is considered 
to be evenly partitioned among m Map tasks [24]. Let Jm, 
Jr and Jcp denote the total time cost of three phases re- 
spectively, T be the total execution time of a MRJ. Then 
T < Jm + Jcp + Jr. holds due to the overlapping between 
Jm and Jcp. 
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Figure 3: MapReduce workflow 

For each Map task, it performs disk I/O and data pro- 
cessing. Since disk I/O is the dominant cost, therefore, we 
can estimate the time cost for single Map task based on disk 
I/O. Disk I/O contains two parts, one is sequential reading, 
the other is data spilling. Then the time cost for single Map 
Task tM is 

Si 

t M = (Ci +p x a) X — (1) 
m 

where C\ is a constant factor regarding disk I/O capability, 
p is a random variable denoting the cost of spilling inter- 
mediate data. For a given system configuration, p subjects 
to the intermediate data size; it increases while spilled data 
size grows, a denotes the output ratio of a Map task, which 
is query specific and can be computed with the selectivity 
estimation. Assume m! is the current number of Map tasks 
running in parallel in the system, then Jm can be computed 
as follows 

m . 
Jm — tM x — - (2) 
m! 

For Jcp, let tap be the time cost for copying the output 
of single Map task to n Reduce tasks, it includes the cost 
of data copying over network as well as overhead of serv- 
ing network protocols, tcp is calculated with the following 
formula, 



tcp 



a x St 
C2 x — - h q x n 



(3) 



n x m 

where C2 is a constant number denoting the efficiency of 
data copying over network, q is a random variable which 
represents the cost of a Map task serving n connections from 
n Reduce tasks. Intuitively, there is a rapid growth of q while 
n gets larger. Thus, Jcp can be computed as follows: 

in 

Jcp = —7 x t C p (4) 
m! 

For Jr., intuitively it is dominated by the Reduce task 
which has the biggest size of input. We assume that the key 
distribution in the input file is random; thus let S\. denote 
the input size of Reduce task i, then according to the Central 
Limit Theorem[20], we can assume for i = 1, n, S£ follows 
a normal distribution N ~ (fi, a), where /1 is determined by 
a x Si and a subjects to data set properties, which can be 
learned from history query logs. Thus, by employing the 
rule of "three sigmas" [20], we make S* = a x Si x n _1 + 3a 
the biggest input size to a Reduce task, then 

Jr. = (p + /3 x Ci) x S; (5) 
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where /3 is a query dependent variable denoting output ra- 
tio, which could be pre-computed based on the selectivity 
estimation. 

Thus, the execution time T of a MR J is: 



T = 



Jm + tcp + Jr 
£m + Jcp + Jr 



if tM > tcp 
if tM < tcp 



(6) 



In our cost model, parameters Ci, C2, p and q are sys- 
tem dependent and need to be derived from observations on 
the execution of real jobs, which are elaborated in the ex- 
periments section. This model favors MRJs that have I/O 
cost dominate the execution time. Experiments show that 
our method can produce a reasonable approximation of the 
MRJ running time in real practice. 

4.2 Estimating c(T): Model for A Group of 
MRJs 

There have been some works exploring the optimization 
opportunity among multiple MRJs running in parallel, like 
[23] [24] and [28] , by defining multiple types of correlations 
among MRJs. For instance, [23] defines "input correlation" , 
"transit correlation" and "job flow correlation", targeting 
at the shared input scan and intermediate data partition. 
In fact, their techniques can be directly plugged into our 
solution framework. Compared to these techniques, the sig- 
nificant difference of our study on the execution model of a 
set of MRJs is that our work takes the number of available 
processing units into consideration. Therefore, the optimiza- 
tion problem we study here is orthogonal with the techniques 
proposed in existing literatures that we mentioned above. 

Given T and fcp processing units, we concern about the 
execution plan V that guarantees the minimum task execu- 
tion time span. However, the determination of V is usually 
subjected to kp. For instance, consider the T given in Fig. 4. 
MRJ(e^), MRJ(e^) and MRJ(e' k ) can be accomplished in 5, 
7, 9 time units if 4, 4, 8 Reduce tasks are assigned to them 
respectively. Thus, if there are over 16 available processing 
units, these three MRJs can be scheduled to run in paral- 
lel and have no computing resource competition. On the 
contrary, if there are not enough processing units, paral- 
lel execution of multiple MRJs can lead to very poor time 
efficiency. It is exactly the classic problem of scheduling 
independent malleable parallel tasks over bounded parallel 
processors, which is NP hard [19]. In this work, we adopt 
the methodology presented in [19]. The method guarantees 
that for any given e > 0, it takes linear time (in terms of 
|T|, kp and to compute a scheduling that promises the 
evaluation time to be at most (1+e) times the optimal one. 
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Figure 4: One execution plan of T={e' i , 

Moreover, to evaluate Q with T, not only the MRJs in T 
must be executed, a merge step is needed to generate the 
final results. Intuitively, if two MRJs share some common 



input relation, their output can be merged using the com- 
mon relation as the key. For instance, Fig. 4 presents one 
possible execution plan of MRJ(e^), MRJ(e^) and MRJ(ej,). 
Assume there are over 16 available processing units, then 
we execute all three jobs in parallel. Since MRJ(e;) and 
MRJ(ej) share the same input IZi and IZ4. Therefore, the 
output of MRJ(e^) and MRJ(ej) can be merged using the 
primary keys from both IZi and IZ4. Later on, the output of 
this step can be further merged with the output of MRJ(eJ c ). 
The total execution time is 9+2=11 time units. In the fig- 
ure, we enclose the merge key with bracket. Note that such a 
merge operation only has output keys or data IDs involved, 
therefore, it can be done very efficiently. 

5. JOIN ALGORITHM 

As discussed in Section 3, the key issues of our solution 
lie in constructing <5jp and selecting T. In section 4, we 
present an analytical study of the execution schedules of a 
single MRJ and multiple MRJs. However, we have not yet 
solve the problem of how to compute a multi-way Theta- 
join in one MRJ. Therefore, in this section, we first present 
our solution to the multi-way Theta-join processing with one 
MRJ. Then, we elaborate the construction of <5jp and the 
selection of T. 

5.1 Multi-way Theta-join Processing with Sin- 
gle MRJ 

As discussed in Section 2, different from Equi-join, we 
cannot use the join attribute as the hash key to answer 
Theta-join in the MapReduce computing framework. Work 
[25] for the first time explores the way to adopt MapReduce 
to answer a Theta-join query. Essentially, it partitions the 
cross-product result space with rectangle regions of bounded 
size, which guarantees the output correctness and the work- 
load balance among Reduce tasks. However, their partition 
method does not have a straightforward extension to solve 
a multi-way Theta-join query. Inspired from work [25], we 
believe that it is a feasible solution to conceptually make the 
cross-product of multiple relations as the starting point and 
figure out a better partition strategy. 

Based on our problem definition, all the possible MRJ can- 
didates for T is a no-edge-repeating path in the join graph 
Qj. Thus, we only consider the case of chain joins. Given a 
chain Theta-join query with m different relations involved, 
we want to derive a (key,value)-based solution that guaran- 
tees the minimum execution time span. Let 5 denote the 
hyper-cube that comprises the cross-product of all m rela- 
tions. Let / denote a space partition function that maps 
5 to a set of disjoint components whose union is exactly 5. 
Intuitively, each component represents a Reduce task, which 
is responsible for checking if any valid join result falls into it. 
Assume there are fcR Reduce tasks, and the cardinality of re- 
lation 1Z is denoted as \1Z\. Then for each Reduce task, it has 
to check Hfai 1^,1 ; Qm resvu ts. However, it is not true that 

fc R J ' 

the more Reduce tasks, the less execution time. As when fcR 
increases, the volume of data copy over network may grow 
significantly. For instance, as shown in Fig. 5, when a Reduce 
task is added, the network volume increases. 

Now we have the two sides of a coin, the number of Reduce 
tasks fcR and partition function /. Our solution is described 
as follows. We first define what an "ideal" partition function 
is; then, we pick one such function and derive a proper /cr 
for the given chain Theta-join query. 
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Let fcjj. denote the j-th tuple in relation IZi. Partition 
function / maps 5 to a set of &r components, denoted as 
C={ci,C2,...,Cfc R }. Let Cnt(t ] n . ,C) denote the total number 

of times that t^. appears in all the components, we define 
the partition score of / as 



Score(f) =X)Z) Cn *(*w«' C ) 



(7) 



i=i i=i 
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(d) Network volumes 4 1 «, | + 1 «, | +4 1 R k (e) Network volumes 2 1 R. | +2 ] flj | +2 ] 

Figure 5: How the network volume increases when more 
Reduce tasks are involved 

Definition 5 Perfect Partition Function, f is a perfect 
partition function iff for a given S, V/cr, Score(f) is mini- 
mized. 

Definition 6 Perfect Partition Class. For a given S, 
the class of all perfect partition functions, J- , is the perfect 
partition class ofS. 

Based on the definition of T , to resolve T for a given S 
requires the "Calculus of Variation" [f 5], which is out of the 
scope of our current discussion. We shall directly present a 
partition function / and prove that / 6 J 7 . 

Theorem 2 To partition a hyper-cube S, the Hilbert Space 
Filling Curve is a perfect partition function f. 

Proof. The minimum value of score function defined in 
Equ.7 is achieved when the following condition holds 

Y, Cnt{t\ ( ,C) = Y Cnt(t\ , C) VI < i, j < n (8) 

u— 1 u— 1 

In other words, in a partition component c, assume the 
number of distinct records from relation IZi is c(lZi), then 
the duplication factor of IZi in this component must be 
n" =1 j^clJZj). Since Hilbert space filling curve defines a 
traversing sequence of every cell in the hyper-cube of IZi x 
...TZn, if we use a Hilbert curve H as a partition method, 
then a component c is actually a continuous segment of %. 
Considering the construction process of H, every dimension 
is recursively divided by the factor of 2, and such recur- 
sive computation occurs the same number of times to all 
dimensions. Note that H defines a traversing sequence that 
traverses cells along each dimension fairly, meaning that if 
H has traversed half of IZi , then H must also have traversed 
half of TZj, where IZj is any other relation. Thus, given any 
partition value (equal to the number of Reduce tasks) &r, 
then a segment of H of length , traverses the same pro- 
portion of records from each dimensions. Let this proportion 
be e. Therefore, the duplication factor for each record from 
IZi is 



nrt ' 



O) 



where n is the number of recursions. Note that the derived 
duplication factor satisfies the condition given in Equ.8. So 
H is a perfect partition function. □ 

After obtaining /, we can further approximate the value of 
/cr which achieves the best query evaluation time efficiency. 
As discussed earlier, fort affects two parts of our cost model, 
the network volume and the expected input size to Reduce 
tasks, both of which are the dominating factors of the execu- 
tion time cost. Therefore, an approximation of the optimal 
kn can be obtained when we try to minimize the following 
value A (by computing the derivative of &r). Notice that 
the first factor in Equ.fO is also a linear combination of &r. 



A = \YY, Cnt(^ 4 ,C) + (l 
i=i j=i 



A) 



ku 



(10) 



Intuitively, the A is a linear combination of the two cost 
factors. Coefficient A denotes the importance of each cost 
factor. For instance, if A < 0.5, it implies that reducing 
the workload of each Reduce task brings more cost saving. 1 
Note that the first cost factor in Equ.(IO) is also a linear 
sum function of /cr. Therefore, by making A' = 0, we can 
get \kn\- 

The pseudo code in Alg.f describes our solution for eval- 
uating a chain Theta-join query in one MRJ. Note that our 
main focus is the generation of (key, value) pairs. One tricky 
method we employed here, as also be employed in work [25] , 
is randomly assigning an observed tuple t-iz i a global ID in 
IZi. The reason is that, each Map task does not have a 
global view of the entire relation. Therefore, when a Map 
task reads a tuple, it cannot tell the exact position of this 
tuple in the relation. 

Algorithm 1: Evaluating a chain Theta-join query in one MRJ 

Data: Query q = Tlx M ... IX K m , |TCi|,...|K m |; 
Result: Query result 

Using Hilbert Space Filling Curve to partition S and compute a 
proper value of &r 

Deciding the mapping: GlobalID(i7^ . )— f a number of 

components in C 

for each Map task do 

GlobalID(r-TC . )•(— Unified random selection in [1, \TZi\] 
for all components that GlobalID(tj^. ) maps to do 
|^ generate (componcntID, t-jz i ) 

for each Reduce task do 

for any combination oftji 1 , t*iz m do 
if It is a valid result then 
Output the result 



5.2 Constructing g' JP 

By applying the Alg.f, we can minimize the time cost to 
evaluate a chain Theta-join query using one MRJ. However, 
usually a group of MRJs is needed to evaluate multi-way 
Theta-joins. Therefore, we now discuss the construction of 
<5jp , which is a subgraph of the join-path graph Qjp and suf- 
ficient to serve the evaluation of N-join query Q. As already 
discussed in Section 3.2, computing <Jjp is a -ftV-complete 
problem, as it requires to enumerate all possible no-edge- 
repeating paths between any pair of vertices. In fact, only 
a subset of the entire ed ge collection in Qjp can be further 
In our experiments we observe that the value of A falls in the 
interval of (0.38,0.46). We set A=0.4 as a constant. 
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employed in T op t- Therefore, we propose two pruning con- 
ditions to effectively reduce the search space in this section. 

The first intuition is that, to select Topt, the case that 
many join conditions are covered by multiple MRJs in Topt 
is not preferred, because each join condition only needs to 
be evaluated once. However, it does not imply that MRJs in 
Topt should strictly cover disjoint sets of join conditions. Be- 
cause sometimes, by including extra join conditions, the out- 
put volume of intermediate results can be reduced. There- 
fore, we exclude a MRJ(e;) on the only condition that there 
are other more efficient ways to evaluate all the join condi- 
tions that MRJ(ei) covers. Formally, we state the pruning 
condition in Lemma 1. 

Lemma 1 Edge e[ should not be considered if there exists a 
collection of edges ES, and the following conditions are sat- 
isfied: 1) I'ie'j) C U'.eES l '( e 'i); 2 ) w ( e 'i) > MflV 6JSS »(4); 
*Me»>£ e , eJ5S *k-). 

Lemma 1 is quite straightforward. If a MR J can be sub- 
stituted with some other MRJs that cover at least the same 
number of join conditions and be evaluated more efficiently 
with less demands on processing units, this MRJ cannot ap- 
pear in Topt- Because T op t is the optimal collection of MRJs 
to evaluate the query, containing any substitutable MRJ 
makes T op t sub-optimal. For the second pruning method, 
we present the following Lemma which further reduces the 
search space. 

Lemma 2 Given two edges e\ and e'j, ife\ is not considered 
and l'(e'i) C l'(e'j), then e'j should not be considered either. 

Proof. Since e\ is not considered, it implies that there 
is a better solution to cover l'(e'i) Pi I' (e'j). And this solution 
can be employed together with I' (e'j) — l'(e'i), which is more 
efficient than computing I' (e'j) in one step. Therefore, I' (e'j) 
should not be considered. □ 

Note that Lemma 2 is orthogonal to Lemma f. Since 
Lemma 1 decides whether a MRJ should be considered as 
a member of Topt, if the answer is negative, we can em- 
ploy Lemma 2 to directly prune more undesired MRJs. By 
employing the two Lemmas proposed above, we develop an 
algorithm to construct CJ'jp efficiently in an incremental man- 
ner, as presented in Alg.2. 

Algorithm 2: Constructing G' Jp 

Data: Q' z containing n vertices and m edges, Q\ p — 0, a sorted 

list WL = 0; 
Result: C/jp 
for i=l :n do 
for j > i do 

for L — l:m do 

4 if there is a L-hop path from IZi to IZj then 

j e' the L-hop path from 1Zi to JZj 

if WL ^ then 

scan WL to find the first group of edges that 
cover e' 

apply Lemma 1 to decide if report e' to Q' lp 
if e is not reported then 
j break //Lemma 2 plays the role 

insert e' into WL such that WL maintains a 
sequence of edges in the ascending order of 
w(e') 



Since we do not care the direction of a path, meaning 
e'(vi, Vj)=e'(vj,Vi), we compute the pair- wise join paths fol- 
lowing a fixed order of vertices (relations). In the Alg.2, we 



employ the linear scan of a sorted list to help decide whether 
a path should be reported in C/jp- One tricky part in the al- 
gorithm is line 4. A straightforward way is to employ DFS 
search from a given starting vertex, then the time complexity 
is 0(m + n). However, it introduces much redundant work 
for every vertex to perform this task. A better solution is be- 
fore we run Alg.2, we firstly traverse Qj once and record the 
L-hop neighbor of every vertex. It takes only 0(m + n) time 
complexity. Then, line 4 can be determined in 0(1) time. 
Overall, we can see the worst time complexity of Alg.2 is 
0(n 2 m). This happens only when Qj is a complete graph. 
In real practice, due to the sparsity of the graph, Alg.2 is 
quick enough to generate C/jp for a given Qj . As observed in 
the experiments, Q' 3P can be generated in the time frame of 
hundreds of microseconds. 

After Q'jp is obtained, we select T op t following the method- 
ology presented in [14], which gives 0(ln(n)) approximation 
ratio of the optimum. 

6. EXPERIMENTS 

To verify the effectiveness of our solution, we conduct ex- 
periments on a real cluster environment with both real and 
synthetic data sets. In this section, we first describe the 
setup configuration of the test-bed and the data sets we 
used. Then we validate our cost model. We compare our 
solution for multi-way Theta-join processing with YSmart 
[23], Hive and Pig. We demonstrate that our solution can 
save on average 30% of query processing time when com- 
pared to the state of art methods. Especially in the cases of 
complex queries over huge volume of data, our method can 
save up to 150% of evaluation time. 

6.1 Experiments Setup 

Our experiments run exclusively on a 13-node cluster, 
where one node serves as the master node (Namenode). Ev- 
ery node has 2x Intel(R) Corc(TM) i7 CPU 950 and 2x 
Kingston DDR-3 1333MHz 4GB of memory, 2.5TB HHD at- 
tached, running 2.6.35-22-server #35-Ubuntu SMP All the 
nodes are connected with a lOGB-switch. In total, the test 
bed has 104 cores, 104GB main memory, and over 25TB 
storage capacity. 



Parameter Name 


Default 


Set 


fs.bloksize 


64MB 


64MB 


io.sort.mb 


100M 


512MB 


io. sort. record. percentage 


0.05 


0.1 


io. sort. spill. percentage 


0.8 


0.9 


io.sort. factor 


100 


300 


df s .replication 


3 


3 



Table 1: Hadoop parameter configuration 

We use Hadoop-0. 20. 205.0 to set up the system. Some 
major Hadoop parameters are given in Table 1, which fol- 
lows the setting suggested by [21]. We use the TestDFSIO 
program to test the I/O performance of the system, and find 
that the system performance is stable, with average writing 
rate 14.69Mb/sec and reading rate 74.26Mb/sec. We run 
each experiment job 10 times and report the average execu- 
tion time. 

The first data set we employed for experiments is a real 
world data set collected from over 2000 mobile base sta- 
tions from Oct 1st to Nov 30 in 2008. The data set records 
571,687,536 phone calls from 2,113,968 different users. The 
data set contains 61 daily data files, which is of 20GB in 
total. The data schema is very simple, which contains the 
following five attribute fields: 



1191 



Data Set Volu 



Data Set Volume 



Data Set Volume 



Data Set Volume 



(a) Input Size: 500GB (b) Input Size: 100GB 

Figure 6: Execution time of sample 

id | d: date | bt: begin Inn/: | I: length | f'.st::: ba^t:. *ltilion code~\ 

In the experiments, we design four queries of different 
complexities. We elaborate the workloads and complexity 
trend of the benchmark queries in Section 6.3.1. To vali- 
date the scalability of our solution, we enlarge the data set 
to 100GB and 500GB, by generating more phone calls, fol- 
lowing the distribution of the number of phone calls along a 
day-time, which is a diurnal pattern (a periodical function 
with 24-hour cycles). 

The second data set we employ is a synthetic but well 
recognized data set that specially designed for the TPC-H 
benchmark queries. We use the free tool DBGEN [1] to 
generate different size of testing data sets. We test almost 
all of the 21 benchmark queries that have multi-way join 
conditions. In this section we present the results of Q7, 
Q17, Q18 and Q21 to demonstrate the effectiveness of our 
solution, as they are well recognized benchmark queries to 
test how complex queries are evaluated. Since some queries 
only involve Equi-join, we slightly amend the join predicate 
to add inequality join conditions. 

In the experiments, we compare our solution with YSmart, 
Hive and Pig. For the mobile data set, we develop the Hive 
and Pig scripts by ourselves. For the TPC-H test, we adopt 
the Hive codes from an open report 2 , and develop efficient 
Pig scripts. 

6.2 Cost Model Validation 

The major factors affecting the performance of a MR J 
are: 1) System parameter settings; 2) Input size and data 
set properties, especially the value distributions of join at- 
tributes; 3) Number of Reduce tasks. For the first factor, 
as elaborated in Section 4, we use random variable p to de- 
note the speed of spilling data to disk under different Map 
output ratio, and random variable q to represent the cost 
of handling network connections of different number of Re- 
duce tasks. We can predict the second factor by running 
data sampling algorithm (We conduct this task after data 
are uploaded to the HDFS). To decide fort, a proper number 
of Reduce tasks, we can get a theoretical value of kn that 
guarantees the minimum execution time span by minimiz- 
ing the cost formula (10). We validate the predication of 
kn with experiments and find that the optimal fen, is mainly 
dominated by the output volume of Map tasks (A ~ 0.4). 

By studying the execution times of sample MRJs config- 
ured with different number of Reduce tasks, we get some 
insightful guidelines for selecting proper &r. We run a sam- 
ple MRJ conducting the join operation, which is included 
in Hadoop's standard release. We test different Map out- 
put sizes (1~200GB) and different value of fc R (2~64). The 
results are shown in Fig . 6. We find that, for a MRJ with 
2 shttp: / /issues. apache.org/jira/secure/attachment/ 12416257/ 
TPC-H on Hive 2009-08-ll.pdf 



(c) Input Size: 10GB (d) Input Size: 1GB 

Join task with different input size 

large inputs, significant performance gains are obtained by 
increasing &r at the very beginning, as shown in Fig. 6(a). 
However, when &r keeps growing up, performance gains be- 
come smaller and smaller. This phenomenon can be clearly 
observed from all four sub-figures in Fig.6. For a MRJ with 
relatively small input size, we see clear inflection point of 
performance when &r grows up, as observed in Fig. 6(b), 
Fig. 6(c) and Fig. 6(d). Thus, we obtain a correlation be- 
tween the input size (with Map output size determined) and 
/cr for the best performance, as shown in Figure 7(a). We 
find that our experiment results can be well matched with a 
fitting curve (dashed line). We use this curve to determine 
/cr for a given MRJ, such that we can compute the distri- 
bution of p and q which serve the estimation of a MRJ's 
running time. We compute p and q by studying an out- 
put controllable self-join program over a synthetic data set. 
Figure 7(b) gives the distributions of p and q according to 
different problem sizes. 
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Figure 7: Selection of /cr, p and q 
To validate the effectiveness of our cost model, we checked 
the same self-join program over the mobile data set. As 
shown in the Fig. 8, our estimation and the real MRJ execu- 
tion time are very close. 
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Figure 8: Cost model validation with a self-join program 

6.3 Query Evaluation 

Compare to YSmart and Hive, which targets at tables 
stored in Hive data warehouse, our solution targets at the 
plain data files stored in Hadoop Distributed File System 
(HDFS). In addition to simply upload the data to HDFS, 
we run a sampling algorithm to collect rough data statistics 
and build the index structure, which is the reason that our 
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Figure 9: Execution time of 4 queries over the mobile data set in different scales, kp < 96 
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Figure 10: Execution time of 4 queries over the mobile data set in different scales, kp < 64 
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method is a little more time consuming for the data upload- 
ing process, as shown in Fig.ll. For comparison, we also 
present the cost for simply uploading data files to HDFS. 
Note that the uploading is performed by each DataNode 
from their local disk. Comparing with Hive, our method 
demonstrates comparable time cost of data uploading for 
large data volumes. 
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Figure 11: The time cost for data loading 

6. 3. 1 Real World Mobile Data 

We design four multi-way Theta-join queries for the mo- 
bile data set, which are of different complexities in terms 
of covering different inequality functions and joining on at- 
tributes with different selectivities. We test benchmark quer- 
ies on different scales of data volumes to validate the scala- 
bility of our solution. In comparison, we also test YSmart, 
Hive and Pig scripts that perform the same tasks. 

For comprehensiveness, we describe the four queries in a 
SQL-like style as follows: 

Ql SELECT t-i.id FROM table ti, table t 2 , table t 3 WHERE 
ti.bt< t 2 .bt, ii.l>t 2 .l, t2-bsc=t3.bsc, t 2 .d=t 3 .d 
Q2 SELECT t 3 .id FROM table h, table t 2j table t 3 WHERE 
ti.bt< i 2 .bt, £i .l>t2 -1, t2.bsc7^3.bsc, t2-d=t3.d 
Q3 SELECT ti.id FROM table ti, table i 2 , table t 3 , ta- 
ble U, WHERE ii.d< t 2 .dt, t 2 .d< t 3 .dt, h.d+3> t 3 .d, 
fi.bsc=i4.bsc 

Q4 SELECT h.id FROM table h, table t 2 , table t 3 , ta- 



ble t 4 , WHERE ti.d< ta.dt, t 2 .d< i 3 .dt, ti.d+3> t 3 .d, 
ti.bsc^i4.bsc 

In plain English, the first two queries return the concur- 
rent phone calls for the same base station and all concurrent 
phone calls at different base stations, respectively. The third 
query returns the user whose calls are handled by the same 
base station 3-day in a row. The fourth query finds out the 
user whose calls are handled by different base stations 3-day 
in a row. These queries can help monitor the workload dis- 
tribution among base stations and capture unusual behavior 
of customers. Table 2 summarizes the features of the bench- 
mark queries. Note that the four queries are listed in the 
ascending order of running time complexity. As shown in 
the table, the benchmark queries being employed cover all 
the inequality functions and have significant differences in 
output size. 



Q 


Relations Cut. 


Inequality Func. 


Join Cnt. 


Result Sel. 


Ql 


All 


{<,>} 


3 


0.00035 


Q2 


All 


{<,>>*} 


3 


0.00108 


Q3 


All 


{<•>} 


4 


0.00079 


Q4 


All 




4 


0.01524 



Table 2: Benchmark query statistics 

As we elaborate in Section 3, there may not be enough 
processing units to evaluate queries in the most time-saving 
fashion. Therefore, we test the benchmark queries by speci- 
fying different number of available processing units, as shown 
in Fig. 9 and Fig. 10, respectively. The results shown in Fig. 9 
demonstrate that our solution has comparable time cost 
comparing with the state of art method YSmart. Especially 
when the query is relatively easy, like Ql and Q2, our so- 
lution at best gives near YSmart performance. The reason 
lies in two folds. First, for simple queries, there is little opti- 
mization opportunity for MRJ scheduling. Second, YSmart 
take multiple inter-MRJ optimization techniques into con- 
sideration, which is not the focus of our work. In this case, 
compare to Hive and Pig, the time saving of our solution lies 
in eliminating unnecessary network volumes and redundant 
Reduce task workloads. 

When we specify kp (the number of available processing 
units) to be at most 64, the advantage of our solution for 
more complex queries are obvious. As shown in Fig. 10, take 
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Figure 12: Execution time of 4 TPC-H benchmark queries in different scales, 
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Figure 13: Execution time of 4 TPC-H benchmark queries in different scales, fcp < 64 



Q4 for instance, our solution achieves about 50% time sav- 
ings comparing to YSmart. 

6. 3. 2 TPC-H Benchmark Queries 

We test almost all 21 benchmark queries from the TPC-H 
benchmark, excluding some simple queries that only involve 
two or three relations and simply join on foreign keys, like 
Ql and Q2. In this section we present the result of 4 queries, 
which are well recognized complex queries for performance 
test. In the experiments, we also run the query under dif- 
ferent available number of processing units. The results are 
presented in Fig. 12 and Fig. 13. Table 3 summarizes the fea- 
tures of the 4 benchmark queries. 



c 


Relations Cnt. 


Inequality Func. 


Join Cnt. 


Result Scl. 


Q7 


5 


{<,>} 


8 


0.00176 


Q17 


3 


{<} 


4 


0.00426 


Q18 


4 


{>} 


4 


0.00021 


Q21 


6 


{>,#} 


8 


0.00087 



Table 3: TPC-H query statistics 

When we consider all processing units are involved in the 
evaluation, as reported by Fig. 12, we have the following ob- 
servations. First, as reported in [23], YSmart generally has 
over 200% speedup comparing to Hive. Second, by taking 
the advantage of index structures and data statistics, our 
solution for the multi-way Theta-join queries have 30% of 
time savings on average compare to YSmart. The reason 
is that, our solution try to minimize the data copying vol- 
ume over network and balance the workload of Reduce tasks. 
Third, for the case that the number of process units is suffi- 
cient, i.e., when the involved data volume is relatively small, 
our method gains more time saving by taking the advan- 
tage of the "greedy" scheduling, as shown in Fig. 12(b) and 
Fig. 12(c). Moreover, along with the increasing of data set 
volume, our solution also demonstrates satisfactory scalabil- 
ity as Hive does. 

When we set fcp to a smaller value, e.g. <64, Fig. 13 shows 
that our method achieves even more time saving comparing 
to a larger fcp (fcp <96). For instance, as shown in Fig. 13(a) 
and Fig. 13(d), along with the growth of underlying data vol- 
umes, our method demonstrates better scalability. Since our 
solution employs fcp-aware scheduling of MRJs, when fcp is 



changed, the selection of T and execution plan are updated 
correspondingly. On the contrary, Hive always try to em- 
ploy as many Reduce tasks as possible to perform a task, 
and YSmart does not take this factor into consideration. 
Therefore, we observe as much as 150% speedup comparing 
to the YSmart solution. 

In summary, as expected and proved by experiments, our 
solution wins the state of art solutions in two aspects: 1) 
when there is not enough processing units, our solution is 
able to dynamically choose a near optimal solution to min- 
imize the evaluation makespan; 2) Our solution takes the 
advantages of data statistics and index structures to guide 
the (key,value) partition among Reduce tasks. On one hand, 
we eliminate unnecessary data copying to perform a Theta- 
join query. On the other hand, we minimize the redundant 
computation in Reduce tasks. Therefore, in the context of 
fitting multi-way Theta-join evaluation in a dynamic Cloud 
computing platform, our solution demonstrates promising 
scalability and execution efficiency. 

7. RELATED WORK 

Existing efforts toward efficient join query evaluation us- 
ing MapReduce mainly fall into two categories. The first 
category is to implement different types of join queries by 
exploring the partition of (key, value) pairs from Map tasks 
to Reduce tasks without touching the implementation de- 
tails of the MapReduce framework. The second category is 
to improve the functionality and efficiency of MapReduce 
itself to achieve better query evaluation performance. For 
example, MapReduce Online [9] allows pipelined job inter- 
connections to avoid intermediate result materialization. A 
PACT model [4] extends the MapReduce concept for com- 
plex relational operations. Our work, as well as work [27] 
on set similarity join, work [25] on Theta-join, all fall in the 
first category. We briefly survey some most related works in 
this category. 

F.N.Afrati and at el. [2] present their novel solution for 
evaluating multi-way Equi-join in one MR J. The essential 
idea is that, for each join key, they logically partition the 
Reduce tasks into different groups such that a valid join re- 
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suit can be discovered on at least one Reduce task. Their 
optimization goal is to minimize the volume of data copying 
over the network. But the solution only works for the Equi- 
join scenario. Because for Equi-join, as long as we make the 
join attribute the partition key, the joinable data records 
that have the same key value will be delivered to the same 
Reduce task. However, for Theta-join queries, such par- 
tition method for (key, value) pairs cannot even guarantee 
the correctness. Moreover, answering complex join queries 
with one MRJ may not guarantee the best time efficiency 
in practice. Wu Sai and et al. [28] targets at the efficient 
processing of multi-way join queries over massive volume of 
data. Although they present their work in the context of 
Equi-join, their focus is how to decompose a complex query 
to multiple MRJs and schedule them to eventually evaluate 
the query as fast as possible. However, their decomposition 
is still join-key oriented. Therefore, after decomposing the 
original query into multiple pair-wise joins, how to select the 
optimal join order is the main problem. On the contrary, al- 
though we also explore the scheduling of MRJs in this work, 
each MRJ being scheduled can involve multiple relations and 
multiple join conditions. Therefore, our solution truly tries 
to explore all possible evaluation plans. Moreover, work [28] 
does not take the limit of processing unit into consideration, 
which is a critical issue in real practice. Some other works 
try to explore the general work flow of single MRJ or mul- 
tiple MRJs to improve the whole throughput performance. 
Hadoop++ [13] injects optimized UDFs into Hadoop to im- 
prove query execution performance. RCFile [17] provides a 
column-wise data storage structure to improve I/O perfor- 
mance in MapReduce-based warehouse systems. MRShare 
[24] explores the optimization opportunities to share the 
file scan and partition key distribution among multiple cor- 
related MRJs. YSmart [23] is a source-to-source SQL to 
MapReduce translator. It proposes a common-MapReduce 
framework to reduce redundant file I/O and duplicated com- 
putation among Reduce tasks. Recent system works pre- 
sented query optimization and data organization solutions 
that can avoid high-cost data re-partitioning when execut- 
ing a complex query plan, like SCOPE [29] and ES 2 [7]. 

8. CONCLUSION 

In this paper, we focus on the efficient evaluation of multi- 
way Theta-join queries using MapReduce. Our solution in- 
cludes two parts. First, we study how to conduct a chain- 
type multi-way Theta-join using one MapReduce job. We 
propose a Hilbert curve based space partition method that 
minimizes data copying volume over network and balances 
the workload among Reduce tasks. Second, we propose a 
resource aware scheduling schema that helps the evaluation 
of complex join queries achieves a near optimal time effi- 
ciency in resource restricted scenarios. Through extensive 
experiments over both synthetic and real world data, our 
solution demonstrates promising query evaluation efficiency 
comparing to the state-of-art solutions. 
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